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Abstract 

Many  corpus-based  natural  language  processing  systems  rely  on  using  large  quantities  of 
annotated  text  as  their  training  examples.  Building  this  kind  of  resource  is  an  expen¬ 
sive  and  labor-intensive  project.  To  minimize  effort  spent  on  annotating  examples  that 
are  not  helpful  the  training  process,  recent  research  efforts  have  begun  to  apply  active 
learning  techniques  to  selectively  choose  data  to  be  annotated.  In  this  work,  we  consider 
selecting  training  examples  with  the  it  tree-entropy  metric.  Our  goal  is  to  assess  how  well 
this  selection  technique  can  be  applied  for  training  different  types  of  parsers.  We  find 
that  tree-entropy  can  signihcantly  reduce  the  amount  of  training  annotation  for  both  a 
history-based  parser  and  an  EM-based  parser.  Moreover,  the  examples  selected  for  the 
history-based  parser  are  also  good  for  training  the  EM-based  parser,  suggesting  that  the 
technique  is  parser  independent. 
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Abstract 

Many  corpus-based  natural  lan¬ 
guage  processing  systems  rely  on 
using  large  quantities  of  annotated 
text  as  their  training  examples. 
Building  this  kind  of  resource  is 
an  expensive  and  labor-intensive 
project.  To  minimize  effort  spent 
on  annotating  examples  that  are  not 
helpful  the  training  process,  recent 
research  efforts  have  begun  to  ap¬ 
ply  active  learning  techniques  to  se¬ 
lectively  choose  data  to  be  anno¬ 
tated.  In  this  work,  we  consider  se¬ 
lecting  training  examples  with  the 
tree-entropy  metric.  Our  goal  is  to 
assess  how  well  this  selection  tech¬ 
nique  can  be  applied  for  training  dif¬ 
ferent  types  of  parsers.  We  find  that 
tree-entropy  can  significantly  reduce 
the  amount  of  training  annotation 
for  both  a  history-based  parser  and 
an  EM-based  parser.  Moreover,  the 
examples  selected  for  the  history- 
based  parser  are  also  good  for  train¬ 
ing  the  EM-based  parser,  suggesting 
that  the  technique  is  parser  indepen¬ 
dent. 

1  Introduction 

In  recent  years,  large  collections  of  text  in 
machine  readable  format  have  become  readily 

This  material  is  based  upon  work  supported  by 
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available.  These  ought  be  valuable  resources 
for  training  natural  language  processing  sys¬ 
tem.  Unfortunately,  most  systems  cannot 
take  advantage  of  the  data  in  their  raw  text 
form;  typically,  the  data  must  be  annotated 
by  a  human  to  become  effective  training  ex¬ 
amples.  For  instance,  consider  the  task  of  in¬ 
ducing  a  grammar  to  parse  English  sentences. 
Studies  have  shown  that  a  grammar  trained 
on  sentences  annotated  with  their  constituent 
trees  produces  much  better  parses  than  one 
trained  on  just  the  sentences  alone  (Pereira 
and  Schabes,  1992).  Recent  state-of-the- 
art  parsers  developed  by  CoUins  (1997)  and 
Charniak  (1999)  are  all  trained  from  hand- 
annotated  corpora  such  as  those  from  the 
Penn  Treebank  Project  (Marcus  et  ah,  1993). 
However,  building  an  annotated  corpus  is  a 
human  labor-intensive  project;  therefore,  it  is 
important  to  find  ways  to  minimize  the  size 
of  the  corpus. 

Out  of  a  large  pool  of  raw  text,  what  sub¬ 
set  should  be  annotated  and  added  to  the 
training  set?  Recent  studies  have  begun  to 
address  this  question  using  sample  seleetion, 
in  which  training  process  is  seen  as  interac¬ 
tive  session  between  the  learning  system  and 
the  human  annotator  (Lewis  and  Gale,  1994), 
(Engelson  and  Dagan,  1996),  (Fuji!  et  ah, 
1998),  (Thompson  et  al.,  1999),  and  (Ngai 
and  Yarowsky,  2000).  The  system  actively 
influences  its  learning  progress  by  evaluating 
potential  candidates  from  the  pool  of  raw  text 
and  selecting  those  with  high  Training  Utility 
Values  (TUV)  for  humans  to  annotate.  As  the 
learning  process  continues,  the  system  should 
become  better  at  identifying  good  training 
candidates  so  that  the  annotators  would  not 


need  to  waste  time  on  processing  nninforma- 
tive  examples. 

This  work  considers  the  problem  of  apply¬ 
ing  sample  selection  techniqnes  to  the  task  of 
training  statistical  parsers.  Onr  primary  chal¬ 
lenge  is  in  designing  a  fnnction  that  can  accn- 
rately  estimate  an  nnlabeled  candidate’s  po¬ 
tential  ntihty  for  training  a  parser.  In  a  previ- 
ons  stndy  (Hwa,  2000b),  we  have  applied  sam¬ 
ple  selection  to  an  indnction  algorithm  based 
on  the  expectation-maximization  (EM)  prin¬ 
ciple  that  indnces  Probabilistic  Lexicalized 
Tree  Insertion  Grammars  (PLTlGs).  In  that 
work,  we  proposed  an  nncertainty-based  eval- 
nation  fnnction  to  estimate  the  TUV  of  nnla¬ 
beled  candidates  called  tree  entropy.  We  have 
empirically  shown  that  sample  selection  with 
tree  entropy  can  rednce  the  size  of  the  training 
corpns  signihcantly.  However,  becanse  only 
an  EM-based  learner  was  nsed,  it  is  nnknown 
whether  the  evalnation  fnnction  wonld  be  gen¬ 
eral  enongh  to  be  applicable  to  other  types  of 
learners.  The  goal  of  this  work  is  to  assess 
the  robnstness  of  the  tree-entropy  evalnation 
fnnction.  We  have  performed  experiments  to 
evalnate  how  well  the  metric  selects  training 
examples  for  different  types  of  parsers  and 
to  determine  whether  examples  selected  for 
one  type  of  parser  might  be  good  for  training 
a  different  type  of  parser.  Onr  experimen¬ 
tal  resnlts  show  that  the  tree-entropy  metric 
can  rednce  the  amonnt  of  training  annotation 
by  23%  for  a  history-based  lexical  statisti¬ 
cal  parser,  the  Model  2  parser  described  by 
Collins  (1997).  Moreover,  we  fonnd  that  the 
data  selected  for  training  the  Collins  Parser 
also  make  good  training  examples  for  indnc- 
ing  the  EM-based  PLTIG  parser,  snggesting 
that  the  tree-entropy  evalnation  fnnction  is 
parser  independent. 

2  The  Learning  Framework 

There  are  two  types  of  sample  selection  al¬ 
gorithms:  eommittee  based  or  single  learner. 
A  committee-based  selection  algorithm  works 
with  mnltiple  learners,  each  maintaining  a  dif¬ 
ferent  hypothesis  (perhaps  pertaining  to  dif¬ 
ferent  aspects  of  the  problem).  The  candidate 
examples  that  lead  to  the  most  disagreements 


[7  is  a  set  of  nnlabeled  candidates, 
i  is  a  set  of  labeled  training  examples. 
M  is  the  cnrrent  model. 

Initialize: 

M  ^  Train{L). 

Repeat 

N  ^  Select{n,  U,  M,  /). 

U  -  N. 

L  ^  L[Mabel{N). 

M  ^  Train{L). 

Until  (M  ~  Mfrue)  or 

(U  =  0)  or  (hnman  stops). 


Fignre  1:  The  psendo-code  for  the  sample  se¬ 
lection  learning  algorithm 

among  the  different  learners  are  considered  to 
have  the  highest  TUV.  (Cohnet  ah,  1994;  Fre- 
nnd  et  ah,  1997).  For  compntationally  inten¬ 
sive  problems,  keeping  mnltiple  learners  may 
be  impractical.  In  this  work,  we  focns  on  sam¬ 
ple  selection  algorithms  that  nse  only  a  single 
learner  that  keeps  jnst  one  working  hypoth¬ 
esis.  Withont  access  to  mnltiple  hypotheses, 
the  selection  algorithm  can  nonetheless  esti¬ 
mate  the  TUV  of  an  example.  We  categorize 
some  possible  ranking  criteria  into  the  follow¬ 
ing  three  classes: 

Problem-space:  Knowledge  abont  the 
problem-space  may  help  to  locate 
good  training  canidates.  For  example, 
knowing  the  distribntion  of  the  pool, 
we  might  select  the  most  freqnently 
occnring  instances. 

Performance  of  the  hypothesis:  Testing 
the  candidates  on  the  cnrrent  hypothesis 
may  show  the  type  of  data  on  which  the 
hypothesis  performs  weakly  (Lewis  and 
Catlett,  1994). 

Parameters  of  the  hypothesis: 

Estimating  the  potential  impact  of 
the  candidates  will  have  on  the  param¬ 
eters  of  the  cnrrent  working  hypothesis 
locates  those  examples  that  will  change 
the  cnrrent  hypothesis  the  most. 


Figure  1  outlines  the  single-learner  sample 
selection  training  loop  in  pseudo-code.  Ini¬ 
tially,  the  training  set,  i,  consists  of  a  small 
number  of  labeled  examples.  The  learner  uses 
L  to  train  an  initial  model  M.  Also  avail¬ 
able  to  the  learner  is  a  large  pool  of  unlabeled 
training  candidates,  U.  In  each  iteration,  the 
selection  algorithm,  Select(n,U,  M,  f  ),  uses 
an  evaluation  function  /  to  compute  the  ex¬ 
pected  TUV  of  each  candidate  in  U  and  re¬ 
turns  the  n  candidates  with  the  highest  val¬ 
ues.  The  set  of  the  n  chosen  candidates  are 
then  labeled  by  human  experts  and  added  to 
the  existing  training  set.  Training  on  the  up¬ 
dated  set  i,  the  system  modihes  the  model 
so  that  it  is  consistent  with  aU  the  examples 
seen  thus  far.  The  loop  continues  until  one  of 
the  stopping  conditions  is  met:  the  model  is 
considered  to  be  good  enough,  all  candidates 
are  labeled,  or  all  human  resources  are  used 
up. 

2.1  The  Evaluation  Function 

At  the  heart  of  the  sample  selection  algorithm 
is  the  evaluation  function  that  predicts  each 
unlabeled  candidate’s  training  utility.  Our 
proposed  function  ranks  candidates  based  on 
the  “performance  of  hypothesis.”  In  other 
words,  we  wish  to  hnd  the  set  of  sentences 
that  the  current  parsing  model  is  the  most 
uncertain  about.  One  way  to  measure  the 
parser’s  uncertainty  is  to  compute  the  tree 
entropy  over  the  distribution  of  parsing  prob- 
abihties  of  the  set  of  trees  produced  by  the 
parser.  More  specihcally,  the  tree  entropy  for 
a  sentence  u  is: 


TE{u,  M)  =  —  ^  Pr{t\u,  M)  log2  Pr{t\u,  M), 
ter 


where  T  is  the  set  of  possible  trees  that  M 
generated  for  u.  Details  of  computing  tree 
entropy  have  been  discussed  previously  (Hwa, 
2000b).  Our  proposed  function  evaluates  each 
candidate  by  measuring  the  similarity  be¬ 
tween  the  tree  entropy  of  the  candidate  and 
the  uniform  distribution  for  the  same  number 
of  trees.  That  is. 


TE{u,M) 

iog2  \r\ 


2.2  Parsing  Models 

To  test  the  robustness  of  the  tree-entropy 
evaluation  function,  we  use  it  to  select  train¬ 
ing  examples  for  the  Collins  Parser  and  the 
PLTIG  parser.  Although  both  are  lexical- 
ized  and  statistical  parsers,  their  learning  al¬ 
gorithms  are  different.  The  CoUins  Parser  is 
a  fully-supervised,  history-based  learner  that 
models  the  parameters  of  the  parser  by  tak¬ 
ing  statistics  directly  from  the  training  data. 
In  contrast,  PLTIG’s  EM-based  induction  al¬ 
gorithm  (Hwa,  2000a)  is  partially-supervised; 
the  model’s  parameters  are  estimated  indi¬ 
rectly  from  the  training  data.  Our  goal  for 
this  study  is  to  determine  whether  the  suc¬ 
cess  of  the  tree-entropy  metric  is  learner  de¬ 
pendent. 

3  Experimental  Setup  and  Results 

Two  experiments  are  performed.  The  hrst 
experiment  assesses  whether  the  tree-entropy 
evaluation  function  can  select  good  examples 
for  a  history-based  learner.  The  second  exper¬ 
iment  is  a  prehminary  study  on  whether  the 
examples  selected  for  a  history-based  learner 
are  also  good  training  examples  for  a  EM- 
based  learner. 

3.1  Experiment  1 

We  use  the  Collins  Parser  as  the  basic  learning 
model  M  in  the  sample  selection  framework 
described  in  Figure  1.  To  simulate  the  in¬ 
teractive  process,  we  create  a  large  unlabeled 
candidate  pool  U  by  stripping  all  annotated 
information  from  sections  02  through  21  of  the 
WaU  Street  Journal  corpus.  Initially,  i,  the 
set  of  labeled  training  data,  consists  of  500 
parsed  sentences.  In  each  iteration,  n  =  1000 
new  sentences  are  picked  from  U  to  be  added 
to  L.  Then,  a  new  parser  is  trained  from  the 
updated  L  and  tested  on  section  00  to  chart 
the  learning  progress. 

We  compare  the  learning  rate  of  the  parser 
trained  on  examples  selected  by  the  tree  en¬ 
tropy  evaluation  function,  fte  with  a  baseline 
in  which  the  model  was  trained  with  exam¬ 
ples  sequentially  selected.  The  experimen¬ 
tal  results  are  graphed  in  Figure  2(a).  The 


Figure  2:  (a)  A  graph  comparing  the  learning  rates  of  the  Collins  Parser  under  two  training 
conditions:  “baseline”  shows  the  progress  of  sequential  training  and  “tree  entropy”  shows  the 
progress  of  sample  selection  training,  (b)  A  graph  showing  the  relative  amounts  of  annotated 
training  data  used  to  achieve  the  same  performance  level  by  the  two  evaluation  functions. 


parsing  performance  on  the  test  sentences  (us¬ 
ing  the  combined  labeled  precision  and  label 
recall  score^  as  the  metric  (Van  Rijsbergen, 
1979))  is  graphed  as  a  function  of  the  number 
of  labeled  constituents  in  the  training  data. 
We  use  the  number  of  constituents  rather  than 
the  number  of  sentences  because  it  is  a  bet¬ 
ter  indicator  of  the  effort  spent  by  the  human 
annotator.  Longer  sentences  tend  to  require 
more  annotation  than  short  ones,  and  thus 
take  more  time  to  analyze. 

Our  results  suggest  that  the  parser  learns 
faster  when  trained  from  examples  selected  by 
fte-  The  learning  rates  of  the  parser  under  the 
two  training  conditions  are  plotted  in  Figure 
2(a).  The  graph  shows  that,  for  a  compara¬ 
ble  amount  of  annotated  constituents  in  the 
training  data,  the  parser  trained  on  examples 
selected  by  fte  typically  performs  better  on 
unseen  test  data  than  the  baseline.  Another 
way  of  interpreting  the  results  is  to  say  that 
the  same  parsing  performance  can  be  achieved 
using  fewer  annotated  training  examples.  Fig¬ 
ure  2(b)  graphs  the  amount  of  reduction  in 
annotated  training  constituents  that  fte  offers 
from  the  baseline  given  comparable  parsing 
performances.  For  the  hnal  parsing  perfor¬ 
mance  of  88.7%,  the  parser  requires  a  base- 

^  Fp-i  =  ^  ’  where  LR  is  the  labeled  recall 

score  and  LP  is  the  labeled  precision  score. 


line  training  set  of  .36,500  sentences  annotated 
with  about  675,000  constituents.  In  contrast, 
the  same  performance  can  be  achieved  using 
a  training  set  of  520,000  constituents  in  the 
23,500  sentences  selected  by  fte,  reducing  the 
number  of  annotated  constituents  by  23%. 

3.2  Experiment  2 

To  determine  the  suitabihty  of  the  selected 
training  examples  across  different  learners, 
we  now  use  PLTIG  as  the  basic  learning 
model  and  compare  the  parsing  performances 
of  three  PLTlGs  induced  from  different  sets 
of  training  sentences:  those  selected  by  the 
tree-entropy  evaluation  function  for  a  PLTIG 
model,  fte(u,  MpiTia),  those  selected  for  the 
CoUins  Parser  in  the  previous  experiment, 
fte(u,  Mcollins),  and  the  baseline  of  sequen¬ 
tial  selection.  Some  modihcations  to  the  ex¬ 
perimental  setup  of  the  previous  experiments 
are  necessary  to  accommodate  the  EM-based 
induction  algorithm  for  PLTIG.  Because  EM- 
based  grammar  induction  is  computationally 
expensive,  this  experiment  is  hmited  to  us¬ 
ing  an  unlabeled  pool  of  3600  sentences,  and 
the  grammars  are  lexicalized  to  part-of-speech 
tags  rather  than  words.  Moreover,  because 
the  algorithm  induces  grammars  that  gener¬ 
ate  binary  branching  trees,  we  evaluate  the 
parsing  accuracy  of  the  test  sentences  with  the 


(a)  (b) 

Figure  3:  (a)  A  graph  comparing  the  learning  rates  of  three  PLTIGs  induced  from  different 
sets  of  training  examples,  (b)  A  graph  showing  the  relative  amounts  of  annotated  training  data 
used  to  induce  the  PLTIGs. 


consistent  bracketing  metric  (i.e.,  the  percent¬ 
age  of  constituents  in  the  proposed  parse  not 
crossing  constituents  of  the  true  parse)  rather 
than  the  average  precision  and  recall  metric^. 

Currently,  one  trial  of  this  experiment  has 
been  completed.  While  a  more  comprehensive 
analysis  is  required,  our  initial  results  sug¬ 
gest  that  the  examples  selected  by  the  tree- 
entropy  metric  are  informative  independent 
of  the  underlying  learning  model.  Compar¬ 
ing  the  learning  rates  of  the  three  PLTIGs 
graphed  in  Figure  3(a),  we  see  that  although 
learning  rate  of  the  grammar  trained  on  exam¬ 
ples  selected  specihcaUy  for  the  PLTIG  model 
is  faster  than  the  one  trained  on  examples  se¬ 
lected  for  the  CoUins  Parser,  both  are  better 
than  the  baseline.  The  bar  graph  in  Figure 
3(b)  shows  the  relative  amounts  of  annota¬ 
tion  used  to  train  each  grammar.  To  achieve 
a  parsing  level  comparable  to  the  baseline’s 
best  performance,  the  grammar  trained  on  ex¬ 
amples  selected  for  the  CoUins  Parser  needed 
about  15%  less  annotation  than  the  baseUne, 
and  the  grammar  trained  on  examples  se¬ 
lected  for  itself  needed  about  33%^  less  anno- 

^The  number  of  proposed  constituents  in  a  binary 
branching  tree  is  always  one  fewer  than  the  length  of 
the  sentence.  The  WSJ  corpus,  on  the  other  hand,  fa¬ 
vors  a  more  flattened  tree  structure  with  considerably 
fewer  brackets  per  sentence.  The  consistent  bracket¬ 
ing  metric  does  not  unfairly  penalize  a  proposed  parse 
tree  for  being  binary  branching. 

^The  hgure  reported  in  our  previous  study  of  36% 


tation  than  the  baseline.  Both  induced  gram¬ 
mars  achieved  slightly  higher  parsing  accu¬ 
racy  than  the  baseline  when  trained  on  all 
examples. 

4  Conclusion  and  Future  Work 

In  this  paper,  we  have  assessed  the  robustness 
of  the  tree-entropy  evaluation  function  as  a 
metric  of  training  utility  values  for  different 
types  of  parsers.  We  have  empirically  shown 
that  tree-entropy  can  select  informative  train¬ 
ing  examples  and  reduce  the  amount  of 
training  annotation  for  both  a  history-based 
learner  and  an  EM-based  learner.  Moreover, 
we  have  found  that  the  training  examples  se¬ 
lected  for  the  history-based  parser  are  also  in¬ 
formative  for  training  the  EM-based  parser. 

In  addition  to  the  tree-entropy  evalua¬ 
tion  function,  which  uses  the  performance  of 
the  hypothesis  as  the  ranking  criterion,  we 
are  exploring  alternative  evaluation  functions 
that  use  problem-space  based  and  parameter- 
conhdence  based  ranking  criteria. 
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